Introduction

Analysis: DH-Stopping

Full Dataset

Linguistic Factors

Frequency

Looks like there could be an effect of frequency; we explore whether or not to put this in the model below.

Frequency and Stress

Here we are looking for an effect of stress, given the apparent importance of frequency. Upon looking at this, we’ve decided not to pursue stress as a linguistic factor. Part of the decision not to pursue stress is because of coding inconsistencies.

Lexical Type & Stress

We briefly explores whether content vs. function words is a useful distinction to keep in the model, but ultimately decided to use word position instead (see below).

Preceding Environment

Based on this plot below, we might want to consider D separately when coding preceding environment.

Here we have collapsed preceding environment groups, but kept D separate as a result of the above visualization.

Obstruents are promoting subsequent stops, which can be viewed as an assimilation in that obstruents are the strongest of the classes, and stopping a fricative is a fortition so obstruents are conditioning stops.

With D, on the other hand, you don’t get as much of that assimilation because it results in a sequence of two identical segments and it’s difficult to parse a boundary between those so you’re more likely to still produce a fricative

We looked at vowels just in case, but the error bars on the most frequently stopped environments make this kind of useless.

Word Position

As a proxy for lexical type (and even potentially frequency) we looked at whether word position might have an effect.

Based on model comparisons, we end up using word position instead of frequency. Word position results in a better model, and word position and frequency are largely colinear anyways.

Examining Colinearity of Word Position and Frequency

We include word position as the predictor in the model (look at the plot), we note though that there are frequency differences, and when we test frequency separately it’s significant

Final Models For Linguistic Factors

pos_lm = lmer(binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
               data = voc_data)
summary(pos_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 |  
##     edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: -3024.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7076 -0.3502 -0.1937  0.0266  4.7956 
## 
## Random effects:
##  Groups                    Name        Variance Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.003374 0.05809 
##  edit_word                 (Intercept) 0.000181 0.01345 
##  Residual                              0.044517 0.21099 
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##                   Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)        0.02164    0.10712 9987.53119   0.202    0.840
## word_posinitial    0.05174    0.10713 9642.16226   0.483    0.629
## word_posmedial    -0.01579    0.10716 9649.73305  -0.147    0.883
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn
## word_posntl -0.998        
## word_posmdl -0.998  0.998

Frequency Model

Fixed Effects:

  • Word Frequency (Lg10WF)

Random Effects:

  • Speaker

  • Word

frq_lm = lmer(binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_data)
summary(frq_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 |  
##     edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: -2997.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.7190 -0.3500 -0.1868  0.0200  4.7863 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0033797 0.05814 
##  edit_word                 (Intercept) 0.0007295 0.02701 
##  Residual                              0.0445248 0.21101 
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) -0.09871    0.03403 33.55768  -2.901 0.006528 ** 
## Lg10WF       0.03093    0.00745 30.19704   4.152 0.000249 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr)
## Lg10WF -0.978

ANOVA for Part of Speech and Frequency Models

## refitting model(s) with ML (instead of REML)
## Data: voc_data
## Models:
## frq_lm: binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## pos_lm: binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##        npar     AIC     BIC logLik deviance  Chisq Df Pr(>Chisq)    
## frq_lm    5 -3003.7 -2966.6 1506.8  -3013.7                         
## pos_lm    6 -3031.7 -2987.1 1521.8  -3043.7 29.976  1  4.374e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We chose position because it resulted in a better model, frequency and position are colinear

pos_preceding_lm = lmer(binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
               data = voc_data)
summary(pos_preceding_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) +  
##     (1 | edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: -3517
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0750 -0.4633 -0.1013  0.0831  4.8370 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 3.462e-03 0.058839
##  edit_word                 (Intercept) 8.605e-05 0.009277
##  Residual                              4.270e-02 0.206642
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)             4.832e-02  1.048e-01  1.124e+04   0.461 0.644890    
## word_posinitial         5.611e-03  1.046e-01  1.109e+04   0.054 0.957220    
## word_posmedial         -1.992e-02  1.045e-01  1.110e+04  -0.191 0.848842    
## preceding_catobstruent  9.225e-02  7.418e-03  2.319e+03  12.437  < 2e-16 ***
## preceding_catsonorant  -2.623e-02  7.588e-03  1.427e+03  -3.456 0.000564 ***
## preceding_catvowel     -2.268e-02  8.219e-03  2.577e+03  -2.759 0.005830 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts
## word_posntl -0.997                                      
## word_posmdl -0.995  0.998                               
## prcdng_ctbs -0.052  0.001   0.001                       
## prcdng_ctsn -0.055  0.004   0.001   0.714               
## prcdng_ctvw -0.080  0.035   0.004   0.651      0.669

We’re filtering out frequency, but not including it in the model because there is no variation in the low-frequency terms

Social Factors

Race and Ethnicity

We began by looking at whether the proportion of dh-stopping differed by race/ethnicity.

Hispanic and Latinx speakers have a higher proportion of stop realizations than do white or Black speakers.

voc_data$descent = factor(voc_data$descent, ordered = FALSE)
voc_data$descent = relevel(voc_data$descent, "hispanic/latinx")

descent_lm = lmer(binary_stop ~ word_pos + preceding_cat + descent + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_data)

summary(descent_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## binary_stop ~ word_pos + preceding_cat + descent + (1 | speaker.of.DH_data_concat) +  
##     (1 | edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: -3507.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0813 -0.4616 -0.0984  0.0819  4.8369 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 3.425e-03 0.058525
##  edit_word                 (Intercept) 8.587e-05 0.009267
##  Residual                              4.270e-02 0.206630
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)             6.101e-02  1.050e-01  1.128e+04   0.581 0.561200    
## word_posinitial         3.744e-03  1.046e-01  1.109e+04   0.036 0.971440    
## word_posmedial         -2.193e-02  1.045e-01  1.110e+04  -0.210 0.833817    
## preceding_catobstruent  9.204e-02  7.417e-03  2.317e+03  12.409  < 2e-16 ***
## preceding_catsonorant  -2.631e-02  7.587e-03  1.425e+03  -3.467 0.000542 ***
## preceding_catvowel     -2.277e-02  8.218e-03  2.575e+03  -2.770 0.005641 ** 
## descentblack           -8.441e-03  1.417e-02  2.128e+02  -0.596 0.552038    
## descentwhite           -2.244e-02  9.935e-03  1.861e+02  -2.259 0.025073 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts prcdng_ctv dscntb
## word_posntl -0.996                                                        
## word_posmdl -0.994  0.998                                                 
## prcdng_ctbs -0.052  0.001   0.001                                         
## prcdng_ctsn -0.055  0.004   0.001   0.714                                 
## prcdng_ctvw -0.080  0.035   0.004   0.651      0.669                      
## descentblck -0.034  0.000   0.000   0.005      0.002     -0.001           
## descentwhit -0.055  0.008   0.008   0.012      0.005      0.004      0.384

Evidence that Hispanic speakers are stopping more than white speakers, but no evidence that they are doing so more than Black speakers, likely as a result of low token counts and a relative lack of variation within the Black speakers

Birth Year

We looked at birth year for all groups:

Race and Birth Year

We also investigated whether or not the three racial categories showed changes over the last ~60 years.

It appears in this dataset that birth year is not a significant predictor of stop proportions.

Field Site

We also examined how dh-stopping proportions vary across the field sites in our dataset. We later focus in on Salinas, Bakersfield, and Sacramento because of relatively lower token counts in other locations.

Field Site by Race/Ethnicity

We also looked at the interaction between race and fieldsite. Note that some field sites have groups not using stopping at all (e.g. only white speakers in Redlands show stopping behavior)

Bakersfield is the only community in which we see significant amounts of stopping for all ethnic groups and we don’t want to interpret the very high rate of stopping in HUM because it’s a low token count. We can be reasonably confident that the rate of stopping in Hispanic speakers in SAL because it’s the intersection of site and race that is best represented.

  • Not going to interact race with field site, and not draw any conclusions about field site at all. No reliable effect of site that goes beyond the effects of descent

Gender

Next we turn to gender (operationalized binarily).

No effect of gender for Hispanic population, but definitely for white speakers and probably (?) for Black speakers. As a result of the model below, we don’t keep gender for the final model.

Model

gender_lm = lmer(binary_stop ~ word_pos + preceding_cat + descent*gender_binary + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_data)

summary(gender_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + descent * gender_binary +  
##     (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: -3493.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0833 -0.4633 -0.0969  0.0832  4.8478 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 3.421e-03 0.058487
##  edit_word                 (Intercept) 8.579e-05 0.009262
##  Residual                              4.269e-02 0.206619
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##                               Estimate Std. Error         df t value Pr(>|t|)
## (Intercept)                  6.322e-02  1.052e-01  1.133e+04   0.601 0.547994
## word_posinitial              3.702e-03  1.046e-01  1.109e+04   0.035 0.971762
## word_posmedial              -2.186e-02  1.045e-01  1.110e+04  -0.209 0.834323
## preceding_catobstruent       9.216e-02  7.417e-03  2.316e+03  12.425  < 2e-16
## preceding_catsonorant       -2.628e-02  7.587e-03  1.424e+03  -3.463 0.000549
## preceding_catvowel          -2.280e-02  8.218e-03  2.574e+03  -2.774 0.005571
## descentblack                 1.113e-02  1.980e-02  2.438e+02   0.562 0.574609
## descentwhite                -1.628e-02  1.379e-02  1.855e+02  -1.181 0.239214
## gender_binarym              -4.442e-03  1.401e-02  1.786e+02  -0.317 0.751572
## descentblack:gender_binarym -3.926e-02  2.837e-02  2.050e+02  -1.384 0.167892
## descentwhite:gender_binarym -1.239e-02  1.991e-02  1.812e+02  -0.623 0.534343
##                                
## (Intercept)                    
## word_posinitial                
## word_posmedial                 
## preceding_catobstruent      ***
## preceding_catsonorant       ***
## preceding_catvowel          ** 
## descentblack                   
## descentwhite                   
## gender_binarym                 
## descentblack:gender_binarym    
## descentwhite:gender_binarym    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts prcdng_ctv dscntb
## word_posntl -0.994                                                        
## word_posmdl -0.992  0.998                                                 
## prcdng_ctbs -0.052  0.001   0.001                                         
## prcdng_ctsn -0.055  0.004   0.001   0.714                                 
## prcdng_ctvw -0.080  0.035   0.004   0.651      0.669                      
## descentblck -0.048  0.001   0.001   0.004     -0.002     -0.003           
## descentwhit -0.074  0.005   0.006   0.011      0.005      0.004      0.424
## gendr_bnrym -0.067  0.000   0.000  -0.005     -0.004      0.000      0.354
## dscntblck:_  0.033  0.000   0.000   0.000      0.004      0.003     -0.698
## dscntwht:g_  0.047  0.000   0.000  -0.003     -0.002     -0.001     -0.293
##             dscntw gndr_b dscntb:_
## word_posntl                       
## word_posmdl                       
## prcdng_ctbs                       
## prcdng_ctsn                       
## prcdng_ctvw                       
## descentblck                       
## descentwhit                       
## gendr_bnrym  0.509                
## dscntblck:_ -0.296 -0.494         
## dscntwht:g_ -0.692 -0.704  0.378

Education

voc_data %>% 
  group_by(education_cont) %>% 
  summarize(prop = mean(binary_stop),
            count = n(),
            CI.Low = ci.low(binary_stop),
            CI.High = ci.high(binary_stop),
            YMin = prop - CI.Low,
            YMax = prop + CI.High) %>% 
  ggplot(aes(x=education_cont,y=prop,alpha=count)) + 
  geom_bar(stat="identity") + 
  geom_errorbar(aes(ymin = YMin, ymax=YMax),
                position = "dodge",
                width=0.25) + 
  theme_minimal() + 
  labs(x="Education", y="Proportion of stop realizations", alpha = "Token Count") 
## Warning: Removed 1 rows containing missing values (`position_stack()`).

Probably not enough data in 1,2,4 to draw any significant conclusions about a main effect of education

Education and Race/Ethnicity

Just in case, we next look at education crossed with race/ethnicity.

There seems to be a pattern for Hispanic/Latinx speakers such that those who only completed high school show the highest proportion of dh-stopping. For Black speakers, there is probably not enough data to find a pattern, and the results are unclear for white speakers.

Model

voc_hispanic <- voc_data %>% 
  filter(descent == "hispanic/latinx") %>% 
  filter(education_cont %in% c(1,3))

educ_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
## boundary (singular) fit: see help('isSingular')
summary(educ_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + education_cont + (1 |  
##     speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: -167.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4182 -0.4917 -0.1202  0.0976  4.2883 
## 
## Random effects:
##  Groups                    Name        Variance Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.003137 0.05601 
##  edit_word                 (Intercept) 0.000000 0.00000 
##  Residual                              0.054279 0.23298 
## Number of obs: 3954, groups:  speaker.of.DH_data_concat, 61; edit_word, 30
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.13545    0.03161   80.46229   4.285 5.02e-05 ***
## word_posmedial           -0.02923    0.01200 3928.10783  -2.435   0.0149 *  
## preceding_catobstruent    0.12465    0.01478 3915.02736   8.436  < 2e-16 ***
## preceding_catsonorant    -0.03181    0.01504 3914.88482  -2.115   0.0345 *  
## preceding_catvowel       -0.03894    0.01659 3915.09862  -2.347   0.0190 *  
## education_cont           -0.02451    0.01059   57.92026  -2.315   0.0242 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv
## word_posmdl  0.008                                        
## prcdng_ctbs -0.348 -0.002                                 
## prcdng_ctsn -0.348 -0.061  0.728                          
## prcdng_ctvw -0.316 -0.544  0.660      0.681               
## educatn_cnt -0.887 -0.008  0.002      0.009      0.009    
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')

The group with the highest rate of stopping NUMERICALLY is Hispanic HS, significantly different from college students but we don’t have enough data to say if they are significantly different from some college/AS speakers. This was achieved by a by-hand ordinal analysis.

Hispanic Speakers

We now turn to just the Hispanic/Latinx subset of the data.

Birth Year

As we saw with the graph in the full dataset, there doesn’t appear to be an effect of birth year.

Gender

We looked at the effect of gender for just the Hispanic/Latinx population. There does not appear to be an effect for this.

Education

We also looked more closely at education for just the Latinx/Hispanic group.

Significant difference between high school and college:

voc_hispanic <- voc_data %>% 
  filter(descent == "hispanic/latinx") %>% 
  filter(education_cont %in% c(1,3))

educ_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)
## boundary (singular) fit: see help('isSingular')
summary(educ_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + education_cont + (1 |  
##     speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: -167.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4182 -0.4917 -0.1202  0.0976  4.2883 
## 
## Random effects:
##  Groups                    Name        Variance Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.003137 0.05601 
##  edit_word                 (Intercept) 0.000000 0.00000 
##  Residual                              0.054279 0.23298 
## Number of obs: 3954, groups:  speaker.of.DH_data_concat, 61; edit_word, 30
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.13545    0.03161   80.46229   4.285 5.02e-05 ***
## word_posmedial           -0.02923    0.01200 3928.10783  -2.435   0.0149 *  
## preceding_catobstruent    0.12465    0.01478 3915.02736   8.436  < 2e-16 ***
## preceding_catsonorant    -0.03181    0.01504 3914.88482  -2.115   0.0345 *  
## preceding_catvowel       -0.03894    0.01659 3915.09862  -2.347   0.0190 *  
## education_cont           -0.02451    0.01059   57.92026  -2.315   0.0242 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv
## word_posmdl  0.008                                        
## prcdng_ctbs -0.348 -0.002                                 
## prcdng_ctsn -0.348 -0.061  0.728                          
## prcdng_ctvw -0.316 -0.544  0.660      0.681               
## educatn_cnt -0.887 -0.008  0.002      0.009      0.009    
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
voc_white <- voc_data %>% 
  filter(descent == "white") %>% 
  filter(education_cont %in% c(2,3))

educ_white_lm = lmer(binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_white)

summary(educ_white_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + education_cont + (1 |  
##     speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_white
## 
## REML criterion at convergence: -1261.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.4179 -0.3995 -0.1234  0.0570  4.9564 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0020868 0.04568 
##  edit_word                 (Intercept) 0.0002217 0.01489 
##  Residual                              0.0399591 0.19990 
## Number of obs: 3652, groups:  speaker.of.DH_data_concat, 56; edit_word, 30
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)            -5.469e-02  1.236e-01  2.280e+03  -0.442   0.6583    
## word_posinitial         2.035e-02  1.175e-01  3.150e+03   0.173   0.8625    
## word_posmedial         -1.109e-02  1.173e-01  3.150e+03  -0.095   0.9247    
## preceding_catobstruent  8.346e-02  1.290e-02  1.311e+03   6.470 1.38e-10 ***
## preceding_catsonorant  -2.196e-02  1.302e-02  9.315e+02  -1.687   0.0920 .  
## preceding_catvowel     -5.788e-03  1.417e-02  1.419e+03  -0.409   0.6829    
## education_cont          2.948e-02  1.434e-02  5.343e+01   2.055   0.0447 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts prcdng_ctv
## word_posntl -0.947                                                 
## word_posmdl -0.943  0.994                                          
## prcdng_ctbs -0.075  0.003   0.001                                  
## prcdng_ctsn -0.080  0.008   0.002   0.706                          
## prcdng_ctvw -0.118  0.056   0.008   0.642      0.668               
## educatn_cnt -0.296 -0.007  -0.007  -0.001     -0.006     -0.001

Bilingualism

Because there are bilingual speakers of English/Spanish in the Latinx/Hispanic group, we start to look at the potential effect of bilingualism on dh-stopping.

There does seem to be an effect of bilingualism, but we now check how bilingualism interacts with education:

Final Model:

voc_hispanic <- voc_data %>% 
  filter(descent == "hispanic/latinx") 

bilingual_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat +  spanish_bilingual*education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + spanish_bilingual *  
##     education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: -541.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9780 -0.5130 -0.1061  0.1051  4.4243 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 4.047e-03 0.063615
##  edit_word                 (Intercept) 6.683e-05 0.008175
##  Residual                              5.095e-02 0.225728
## Number of obs: 5374, groups:  speaker.of.DH_data_concat, 84; edit_word, 30
## 
## Fixed effects:
##                                       Estimate Std. Error         df t value
## (Intercept)                           -0.05561    0.07076   82.22597  -0.786
## word_posmedial                        -0.02436    0.01060  138.76869  -2.299
## preceding_catobstruent                 0.10952    0.01226 1355.24155   8.932
## preceding_catsonorant                 -0.02927    0.01257  866.74651  -2.328
## preceding_catvowel                    -0.03995    0.01376 1719.94883  -2.904
## spanish_bilingualyes                   0.19768    0.07538   78.95706   2.623
## education_cont                         0.03389    0.02231   78.84693   1.519
## spanish_bilingualyes:education_cont   -0.05839    0.02444   78.90459  -2.389
##                                     Pr(>|t|)    
## (Intercept)                          0.43419    
## word_posmedial                       0.02302 *  
## preceding_catobstruent               < 2e-16 ***
## preceding_catsonorant                0.02012 *  
## preceding_catvowel                   0.00373 ** 
## spanish_bilingualyes                 0.01047 *  
## education_cont                       0.13279    
## spanish_bilingualyes:education_cont  0.01929 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_
## word_posmdl -0.007                                                      
## prcdng_ctbs -0.128 -0.003                                               
## prcdng_ctsn -0.124 -0.056  0.722                                        
## prcdng_ctvw -0.107 -0.514  0.654      0.674                             
## spnsh_blngl -0.919  0.002  0.002     -0.003     -0.008                  
## educatn_cnt -0.970 -0.002  0.003      0.001     -0.004      0.911       
## spnsh_bln:_  0.886 -0.002 -0.006     -0.001      0.005     -0.974 -0.913

There is an interesting pattern emerging whereby bilinguals with only HS show the highest proportion of dh-stopping, but in monolingual speakers dh-stopping numerically correlates with more, not less, education.

Mixed Descent

Originally we had coded mixed descent, but for ambiguous coding reasons we decided not to explore this any further.

Field Site

We look here at the field sites, but only the Hispanic/Latinx speakers.

This section needs to be tidied up!!!!

voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAC")

voc_hispanic <- voc_data %>%
  filter(site %in% fieldsub) %>% 
  filter(descent == "hispanic/latinx") 

bilingual_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat +  spanish_bilingual*education_cont + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + spanish_bilingual *  
##     education_cont + site + (1 | speaker.of.DH_data_concat) +  
##     (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: -424.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.0058 -0.5160 -0.1116  0.1138  4.4466 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0040973 0.06401 
##  edit_word                 (Intercept) 0.0001053 0.01026 
##  Residual                              0.0512832 0.22646 
## Number of obs: 4691, groups:  speaker.of.DH_data_concat, 74; edit_word, 29
## 
## Fixed effects:
##                                       Estimate Std. Error         df t value
## (Intercept)                           -0.09841    0.07613   70.14537  -1.293
## word_posmedial                        -0.02574    0.01164  134.03503  -2.212
## preceding_catobstruent                 0.11078    0.01323 1337.86948   8.374
## preceding_catsonorant                 -0.03045    0.01352  926.37066  -2.252
## preceding_catvowel                    -0.04087    0.01482 1696.68959  -2.757
## spanish_bilingualyes                   0.18212    0.07922   67.55428   2.299
## education_cont                         0.03683    0.02356   67.37238   1.563
## siteBAK                                0.04583    0.02649   67.25072   1.730
## siteSAL                                0.06609    0.02573   67.26394   2.568
## spanish_bilingualyes:education_cont   -0.05971    0.02575   67.45179  -2.319
##                                     Pr(>|t|)    
## (Intercept)                          0.20036    
## word_posmedial                       0.02863 *  
## preceding_catobstruent               < 2e-16 ***
## preceding_catsonorant                0.02458 *  
## preceding_catvowel                   0.00589 ** 
## spanish_bilingualyes                 0.02460 *  
## education_cont                       0.12269    
## siteBAK                              0.08825 .  
## siteSAL                              0.01244 *  
## spanish_bilingualyes:education_cont  0.02343 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_ sitBAK
## word_posmdl -0.010                                                             
## prcdng_ctbs -0.126 -0.004                                                      
## prcdng_ctsn -0.125 -0.054  0.726                                               
## prcdng_ctvw -0.108 -0.505  0.656      0.675                                    
## spnsh_blngl -0.874  0.000  0.001     -0.005     -0.007                         
## educatn_cnt -0.937 -0.003  0.003      0.001     -0.001      0.908              
## siteBAK     -0.225  0.009 -0.005      0.005     -0.004     -0.070 -0.010       
## siteSAL     -0.182  0.007 -0.006      0.003     -0.003     -0.090 -0.046  0.750
## spnsh_bln:_  0.849  0.000 -0.005      0.001      0.004     -0.970 -0.912  0.052
##             sitSAL
## word_posmdl       
## prcdng_ctbs       
## prcdng_ctsn       
## prcdng_ctvw       
## spnsh_blngl       
## educatn_cnt       
## siteBAK           
## siteSAL           
## spnsh_bln:_  0.035

Latinx

voc_data %>% 
  filter(descent == "hispanic/latinx") %>%
  filter(site %in% fieldsub) %>% 
  group_by(spanish_bilingual,education_cont,site) %>% 
  summarize(prop = mean(binary_stop),
            count = n(),
            CI.Low = ci.low(binary_stop),
            CI.High = ci.high(binary_stop),
            YMin = prop - CI.Low,
            YMax = prop + CI.High) %>% 
  ggplot(aes(x=education_cont,y=prop,alpha=count)) + 
  geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin = YMin, ymax=YMax),
                position = "dodge",
                width=0.25) + 
  theme_minimal() + 
  labs(x="Spanish Bilingual", y="Proportion of stop realizations", alpha = "Token Count") + 
  facet_wrap(spanish_bilingual~site)
## `summarise()` has grouped output by 'spanish_bilingual', 'education_cont'. You
## can override using the `.groups` argument.

Just Bakersfield and Salinas

BAKSAL <- c("BAK","SAL")
voc_data %>% 
  filter(descent == "hispanic/latinx") %>%
  filter(site %in% BAKSAL) %>% 
  group_by(education_cont,site) %>% 
  summarize(prop = mean(binary_stop),
            count = n(),
            CI.Low = ci.low(binary_stop),
            CI.High = ci.high(binary_stop),
            YMin = prop - CI.Low,
            YMax = prop + CI.High) %>% 
  ggplot(aes(x=education_cont,y=prop,alpha=count)) + 
  geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin = YMin, ymax=YMax),
                position = "dodge",
                width=0.25) + 
  theme_minimal() + 
  labs(x="Education", y="Proportion of stop realizations", alpha = "Token Count") + 
  facet_wrap(~site)
## `summarise()` has grouped output by 'education_cont'. You can override using
## the `.groups` argument.

Bakersfield is a mix between monolinguals and bilinguals, SAL results being driven by bilinguals

voc_data %>% 
  filter(site %in% fieldsub) %>% 
  group_by(site,spanish_bilingual) %>% 
  summarize(participants = n_distinct(speaker.of.DH_data_concat))
## `summarise()` has grouped output by 'site'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups:   site [3]
##   site  spanish_bilingual participants
##   <fct> <chr>                    <int>
## 1 SAC   no                          42
## 2 SAC   yes                          5
## 3 BAK   no                          37
## 4 BAK   yes                         17
## 5 SAL   no                           8
## 6 SAL   yes                         31

voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAL")

voc_hispanic <- voc_data %>%
  filter(site %in% BAKSAL) %>% 
  filter(descent == "hispanic/latinx") 

bilingual_hisp_lm = lmer(binary_stop ~ word_pos + preceding_cat +  spanish_bilingual + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binary_stop ~ word_pos + preceding_cat + spanish_bilingual +  
##     site + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: 86.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9531 -0.5204 -0.1274  0.1154  4.1152 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0051335 0.07165 
##  edit_word                 (Intercept) 0.0001718 0.01311 
##  Residual                              0.0573700 0.23952 
## Number of obs: 4079, groups:  speaker.of.DH_data_concat, 64; edit_word, 29
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.08042    0.02596  101.81583   3.098  0.00252 ** 
## word_posmedial           -0.03042    0.01358  121.47691  -2.241  0.02687 *  
## preceding_catobstruent    0.11949    0.01515 1160.52527   7.886 7.13e-15 ***
## preceding_catsonorant    -0.03821    0.01559  833.16416  -2.451  0.01443 *  
## preceding_catvowel       -0.04775    0.01703 1540.71470  -2.804  0.00510 ** 
## spanish_bilingualyes      0.01110    0.02293   60.60032   0.484  0.63028    
## siteBAK                  -0.01461    0.02021   60.49472  -0.723  0.47248    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_
## word_posmdl -0.050                                               
## prcdng_ctbs -0.421 -0.005                                        
## prcdng_ctsn -0.412 -0.046  0.732                                 
## prcdng_ctvw -0.372 -0.496  0.662      0.674                      
## spnsh_blngl -0.713  0.008 -0.018     -0.023     -0.014           
## siteBAK     -0.440  0.006  0.002      0.001     -0.002      0.187

Salinas Only

voc_data %>% 
  filter(descent == "hispanic/latinx") %>%
  filter(site == "SAL") %>% 
  group_by(education_cont) %>% 
  summarize(prop = mean(binary_stop),
            count = n(),
            CI.Low = ci.low(binary_stop),
            CI.High = ci.high(binary_stop),
            YMin = prop - CI.Low,
            YMax = prop + CI.High) %>% 
  ggplot(aes(x=education_cont,y=prop,alpha=count)) + 
  geom_bar(stat="identity", position="dodge") + 
  geom_errorbar(aes(ymin = YMin, ymax=YMax),
                position = "dodge",
                width=0.25) + 
  theme_minimal() + 
  labs(x="Education Value", y="Proportion of stop realizations", alpha = "Token Count") + 
  scale_alpha(range = c(.3,.9))

Notes

WE ARE KEEPING WITHOUT!! not looking at mixed because its meaning for Latinx speakers is ambiguous

ToDo:

  • Fix Shading for Alpha
  • Education in SAL
    • MODEL 4 (Salinas’ Version)

Important Findings: - MODEL 1: - Latinx speakers have more stopping

  • MODEL 2:

  • Education only main effect for Latinx speakers, between HS and College

  • Sacramento has a low rate among Latinx speakers

  • All fieldsites except SAC have pretty similar rates of stopping

  • MODEL 2B (WHITE PPL MODEL):

  • Education effect for white speakers between some college and finished college (overall uninterpretable pattern)

  • MODEL 3:

  • Not replicating finding of bilingualism present in whole dataset when we subset to just BAK and SAL

Analysis: Intermediates & Stops

voc_data %>% 
  group_by(realization) %>% 
  summarize(count = n())
## # A tibble: 4 × 2
##   realization  count
##   <chr>        <int>
## 1 deleted       1964
## 2 fricative     8612
## 3 intermediate  1185
## 4 stop           642

Full Dataset

Linguistic Factors

Lexical Type & Stress

Preceding Environment

Word Position

BRAN TO FIX REFERENCE LEVEL HERE!!!!!!!!!!

pos_lm = lmer(binned_binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
               data = voc_data)
summary(pos_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) +  
##     (1 | edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: 8175.1
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.92700 -0.56627 -0.30792  0.04076  3.14480 
## 
## Random effects:
##  Groups                    Name        Variance Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.010117 0.10058 
##  edit_word                 (Intercept) 0.001306 0.03613 
##  Residual                              0.109403 0.33076 
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##                  Estimate Std. Error        df t value Pr(>|t|)
## (Intercept)     3.288e-02  1.705e-01 4.939e+03   0.193    0.847
## word_posinitial 1.567e-01  1.707e-01 4.498e+03   0.918    0.359
## word_posmedial  1.765e-02  1.708e-01 4.546e+03   0.103    0.918
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn
## word_posntl -0.997        
## word_posmdl -0.997  0.996
frq_lm = lmer(binned_binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_data)
summary(frq_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) +  
##     (1 | edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: 8195.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9342 -0.5639 -0.3033  0.0357  3.1648 
## 
## Random effects:
##  Groups                    Name        Variance Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.010121 0.10060 
##  edit_word                 (Intercept) 0.003731 0.06108 
##  Residual                              0.109380 0.33073 
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) -0.18028    0.07166 31.53935  -2.516 0.017178 *  
## Lg10WF       0.06733    0.01588 29.30850   4.241 0.000204 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##        (Intr)
## Lg10WF -0.980

ANOVA for Part of Speech and Frequency Models

## refitting model(s) with ML (instead of REML)
## Data: voc_data
## Models:
## frq_lm: binned_binary_stop ~ Lg10WF + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
## pos_lm: binned_binary_stop ~ word_pos + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##        npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## frq_lm    5 8192.7 8229.8 -4091.3   8182.7                         
## pos_lm    6 8171.7 8216.2 -4079.8   8159.7 22.999  1  1.621e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Examining Colinearity of Word Position and Frequency

pos_preceding_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) + (1 | edit_word),
                        data = voc_data)
summary(pos_preceding_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: 
## binned_binary_stop ~ word_pos + preceding_cat + (1 | speaker.of.DH_data_concat) +  
##     (1 | edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: 7484.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2011 -0.5610 -0.2281  0.0685  3.2510 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0102649 0.10132 
##  edit_word                 (Intercept) 0.0008984 0.02997 
##  Residual                              0.1032691 0.32136 
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.06671    0.16524 6121.43421   0.404   0.6864    
## word_posinitial           0.08435    0.16497 5644.30228   0.511   0.6091    
## word_posmedial            0.01107    0.16482 5687.37189   0.067   0.9465    
## preceding_catobstruent    0.17255    0.01192 5968.36530  14.470  < 2e-16 ***
## preceding_catsonorant    -0.05153    0.01234 3854.24265  -4.175 3.05e-05 ***
## preceding_catvowel       -0.02587    0.01322 5164.58738  -1.957   0.0504 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts
## word_posntl -0.996                                      
## word_posmdl -0.994  0.996                               
## prcdng_ctbs -0.054  0.002   0.001                       
## prcdng_ctsn -0.056  0.003   0.000   0.726               
## prcdng_ctvw -0.081  0.035   0.005   0.665      0.668

We’re filtering out frequency, but not including it in the model because there is no variation in the low-frequency terms

Social Factors

Race and Ethnicity

voc_data$descent = factor(voc_data$descent, ordered = FALSE)
voc_data$descent = relevel(voc_data$descent, "hispanic/latinx")

descent_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + descent + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_data)

summary(descent_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + descent + (1 |  
##     speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_data
## 
## REML criterion at convergence: 7480.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2019 -0.5642 -0.2269  0.0709  3.2425 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0094287 0.09710 
##  edit_word                 (Intercept) 0.0008996 0.02999 
##  Residual                              0.1032652 0.32135 
## Number of obs: 12403, groups:  speaker.of.DH_data_concat, 189; edit_word, 31
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.10516    0.16550 6132.42131   0.635   0.5252    
## word_posinitial           0.08012    0.16496 5626.59708   0.486   0.6272    
## word_posmedial            0.00653    0.16482 5669.84573   0.040   0.9684    
## preceding_catobstruent    0.17192    0.01192 5964.57612  14.417  < 2e-16 ***
## preceding_catsonorant    -0.05180    0.01234 3850.13914  -4.197 2.77e-05 ***
## preceding_catvowel       -0.02617    0.01322 5159.51461  -1.980   0.0478 *  
## descentblack             -0.04212    0.02317  217.97809  -1.818   0.0705 .  
## descentwhite             -0.06626    0.01631  188.38856  -4.064 7.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts prcdng_ctv dscntb
## word_posntl -0.995                                                        
## word_posmdl -0.993  0.996                                                 
## prcdng_ctbs -0.055  0.002   0.001                                         
## prcdng_ctsn -0.056  0.003   0.000   0.726                                 
## prcdng_ctvw -0.082  0.035   0.005   0.665      0.668                      
## descentblck -0.035  0.000   0.000   0.004      0.002     -0.001           
## descentwhit -0.057  0.007   0.008   0.012      0.005      0.004      0.389

Birth Year

Birth Year and Race

Field Site

Field Site and Race

Gender & Race

voc_data %>% 
  select(c(descent, speaker.of.DH_data_concat)) %>% 
  unique() %>% 
  group_by(descent) %>% 
  summarize(count = n())
## # A tibble: 3 × 2
##   descent         count
##   <fct>           <int>
## 1 hispanic/latinx    84
## 2 black              25
## 3 white              81

We don’t see an interaction between race and gender

Education

voc_data %>% 
  group_by(education_cont) %>% 
  summarize(prop = mean(binned_binary_stop),
            count = n(),
            CI.Low = ci.low(binned_binary_stop),
            CI.High = ci.high(binned_binary_stop),
            YMin = prop - CI.Low,
            YMax = prop + CI.High) %>% 
  ggplot(aes(x=education_cont,y=prop,alpha=count)) + 
  geom_bar(stat="identity") + 
  geom_errorbar(aes(ymin = YMin, ymax=YMax),
                position = "dodge",
                width=0.25) + 
  theme_minimal() + 
  labs(x="Education", y="Proportion of stop realizations", alpha = "Token Count") 
## Warning: Removed 1 rows containing missing values (`position_stack()`).

Education & Race

Model

voc_hispanic <- voc_data %>% 
  filter(descent == "hispanic/latinx") %>% 
  filter(education_cont %in% c(1,3))

educ_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(educ_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + education_cont +  
##     (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: 3063.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.93785 -0.58209 -0.25503  0.08146  2.97118 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0103260 0.10162 
##  edit_word                 (Intercept) 0.0008229 0.02869 
##  Residual                              0.1220237 0.34932 
## Number of obs: 3954, groups:  speaker.of.DH_data_concat, 61; edit_word, 30
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.27355    0.05522   77.88480   4.954 4.14e-06 ***
## word_posmedial           -0.08700    0.02172   77.91077  -4.005 0.000141 ***
## preceding_catobstruent    0.19944    0.02330 1538.80420   8.560  < 2e-16 ***
## preceding_catsonorant    -0.05679    0.02402 1116.58293  -2.365 0.018223 *  
## preceding_catvowel       -0.05593    0.02595 1770.61216  -2.155 0.031306 *  
## education_cont           -0.02792    0.01852   57.02897  -1.507 0.137305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv
## word_posmdl -0.052                                        
## prcdng_ctbs -0.320 -0.009                                 
## prcdng_ctsn -0.321 -0.049  0.748                          
## prcdng_ctvw -0.290 -0.443  0.682      0.692               
## educatn_cnt -0.888 -0.007  0.002      0.008      0.008

Just Hispanic Speakers

Birth Year

Gender

Education

To look at!

Significant difference between high school and college:

voc_hispanic <- voc_data %>% 
filter(descent == "hispanic/latinx") %>% 
filter(education_cont %in% c(1,3))

educ_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(educ_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + education_cont +  
##     (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: 3063.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -1.93785 -0.58209 -0.25503  0.08146  2.97118 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0103260 0.10162 
##  edit_word                 (Intercept) 0.0008229 0.02869 
##  Residual                              0.1220237 0.34932 
## Number of obs: 3954, groups:  speaker.of.DH_data_concat, 61; edit_word, 30
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.27355    0.05522   77.88480   4.954 4.14e-06 ***
## word_posmedial           -0.08700    0.02172   77.91077  -4.005 0.000141 ***
## preceding_catobstruent    0.19944    0.02330 1538.80420   8.560  < 2e-16 ***
## preceding_catsonorant    -0.05679    0.02402 1116.58293  -2.365 0.018223 *  
## preceding_catvowel       -0.05593    0.02595 1770.61216  -2.155 0.031306 *  
## education_cont           -0.02792    0.01852   57.02897  -1.507 0.137305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv
## word_posmdl -0.052                                        
## prcdng_ctbs -0.320 -0.009                                 
## prcdng_ctsn -0.321 -0.049  0.748                          
## prcdng_ctvw -0.290 -0.443  0.682      0.692               
## educatn_cnt -0.888 -0.007  0.002      0.008      0.008
voc_white <- voc_data %>% 
  filter(descent == "white") %>% 
  filter(education_cont %in% c(2,3))

educ_white_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat + education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_white)

summary(educ_white_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + education_cont +  
##     (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_white
## 
## REML criterion at convergence: 1794.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.5222 -0.5275 -0.2275  0.0390  3.2811 
## 
## Random effects:
##  Groups                    Name        Variance Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.005563 0.07458 
##  edit_word                 (Intercept) 0.001261 0.03551 
##  Residual                              0.091991 0.30330 
## Number of obs: 3652, groups:  speaker.of.DH_data_concat, 56; edit_word, 30
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)              -0.09907    0.19076 1840.92025  -0.519   0.6036    
## word_posinitial           0.11549    0.18060 2278.09223   0.639   0.5226    
## word_posmedial            0.03756    0.18028 2281.14951   0.208   0.8350    
## preceding_catobstruent    0.16155    0.02007 1864.67149   8.049 1.48e-15 ***
## preceding_catsonorant    -0.04658    0.02040 1382.42445  -2.284   0.0225 *  
## preceding_catvowel        0.00015    0.02203 1842.64148   0.007   0.9946    
## education_cont            0.03812    0.02304   54.84947   1.654   0.1038    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_psn wrd_psm prcdng_ctb prcdng_cts prcdng_ctv
## word_posntl -0.942                                                 
## word_posmdl -0.938  0.993                                          
## prcdng_ctbs -0.077  0.003   0.001                                  
## prcdng_ctsn -0.081  0.008   0.002   0.716                          
## prcdng_ctvw -0.119  0.056   0.009   0.653      0.673               
## educatn_cnt -0.309 -0.006  -0.006  -0.002     -0.006     -0.002

Bilingualism

Final Model:

voc_hispanic <- voc_data %>% 
filter(descent == "hispanic/latinx") 

bilingual_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat +  spanish_bilingual*education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual *  
##     education_cont + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: 4075.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1396 -0.5876 -0.2530  0.1164  3.0978 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0132811 0.11524 
##  edit_word                 (Intercept) 0.0007517 0.02742 
##  Residual                              0.1196839 0.34595 
## Number of obs: 5374, groups:  speaker.of.DH_data_concat, 84; edit_word, 30
## 
## Fixed effects:
##                                       Estimate Std. Error         df t value
## (Intercept)                            0.24732    0.12495   81.40889   1.979
## word_posmedial                        -0.07929    0.01910   65.60834  -4.152
## preceding_catobstruent                 0.19130    0.01946 1976.36198   9.833
## preceding_catsonorant                 -0.03999    0.02016 1319.58586  -1.984
## preceding_catvowel                    -0.04854    0.02173 2182.80497  -2.233
## spanish_bilingualyes                   0.03235    0.13314   78.26178   0.243
## education_cont                        -0.02002    0.03941   78.16766  -0.508
## spanish_bilingualyes:education_cont   -0.01360    0.04317   78.22478  -0.315
##                                     Pr(>|t|)    
## (Intercept)                           0.0511 .  
## word_posmedial                      9.72e-05 ***
## preceding_catobstruent               < 2e-16 ***
## preceding_catsonorant                 0.0475 *  
## preceding_catvowel                    0.0256 *  
## spanish_bilingualyes                  0.8087    
## education_cont                        0.6128    
## spanish_bilingualyes:education_cont   0.7537    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_
## word_posmdl -0.025                                                      
## prcdng_ctbs -0.117 -0.007                                               
## prcdng_ctsn -0.114 -0.045  0.736                                        
## prcdng_ctvw -0.098 -0.431  0.670      0.679                             
## spnsh_blngl -0.920  0.001  0.002     -0.001     -0.006                  
## educatn_cnt -0.971 -0.002  0.003      0.002     -0.003      0.911       
## spnsh_bln:_  0.886 -0.001 -0.005     -0.002      0.004     -0.974 -0.913

Education and Bilingualism

Bilingualism and Mixed Descent

Field Site

voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAC")

voc_hispanic <- voc_data %>%
  filter(site %in% fieldsub) %>% 
  filter(descent == "hispanic/latinx") 

bilingual_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat +  spanish_bilingual*education_cont + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual *  
##     education_cont + site + (1 | speaker.of.DH_data_concat) +  
##     (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: 3503
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1717 -0.6017 -0.2459  0.1503  3.1211 
## 
## Random effects:
##  Groups                    Name        Variance  Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.0125872 0.11219 
##  edit_word                 (Intercept) 0.0008417 0.02901 
##  Residual                              0.1179785 0.34348 
## Number of obs: 4691, groups:  speaker.of.DH_data_concat, 74; edit_word, 29
## 
## Fixed effects:
##                                       Estimate Std. Error         df t value
## (Intercept)                          1.438e-01  1.305e-01  6.915e+01   1.102
## word_posmedial                      -8.024e-02  2.031e-02  6.558e+01  -3.950
## preceding_catobstruent               1.968e-01  2.067e-02  1.817e+03   9.517
## preceding_catsonorant               -3.413e-02  2.130e-02  1.283e+03  -1.602
## preceding_catvowel                  -5.331e-02  2.308e-02  2.040e+03  -2.310
## spanish_bilingualyes                -5.414e-03  1.358e-01  6.663e+01  -0.040
## education_cont                      -1.449e-02  4.040e-02  6.647e+01  -0.359
## siteSAL                              1.452e-01  4.413e-02  6.645e+01   3.291
## siteBAK                              1.114e-01  4.543e-02  6.643e+01   2.451
## spanish_bilingualyes:education_cont -1.279e-02  4.414e-02  6.655e+01  -0.290
##                                     Pr(>|t|)    
## (Intercept)                         0.274200    
## word_posmedial                      0.000193 ***
## preceding_catobstruent               < 2e-16 ***
## preceding_catsonorant               0.109301    
## preceding_catvowel                  0.020986 *  
## spanish_bilingualyes                0.968320    
## education_cont                      0.720866    
## siteSAL                             0.001600 ** 
## siteBAK                             0.016874 *  
## spanish_bilingualyes:education_cont 0.772897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_ edctn_ sitSAL
## word_posmdl -0.026                                                             
## prcdng_ctbs -0.116 -0.007                                                      
## prcdng_ctsn -0.116 -0.045  0.738                                               
## prcdng_ctvw -0.100 -0.433  0.669      0.679                                    
## spnsh_blngl -0.875 -0.001  0.001     -0.003     -0.006                         
## educatn_cnt -0.937 -0.003  0.003      0.002     -0.002      0.908              
## siteSAL     -0.182  0.006 -0.006      0.002     -0.003     -0.089 -0.045       
## siteBAK     -0.225  0.007 -0.005      0.003     -0.003     -0.070 -0.010  0.750
## spnsh_bln:_  0.849  0.001 -0.005      0.000      0.004     -0.970 -0.912  0.034
##             sitBAK
## word_posmdl       
## prcdng_ctbs       
## prcdng_ctsn       
## prcdng_ctvw       
## spnsh_blngl       
## educatn_cnt       
## siteSAL           
## siteBAK           
## spnsh_bln:_  0.052

Education & Bilingualism

Salinas and Bakersfield Only

Bakersfield is a mix between monolinguals and bilinguals, SAL results being driven by bilinguals

voc_data$site = factor(voc_data$site, ordered = FALSE)
voc_data$site = relevel(voc_data$site, "SAL")

voc_hispanic <- voc_data %>%
  filter(site %in% BAKSAL) %>% 
  filter(descent == "hispanic/latinx") 

bilingual_hisp_lm = lmer(binned_binary_stop ~ word_pos + preceding_cat +  spanish_bilingual + site + (1 | speaker.of.DH_data_concat) + (1 | edit_word), data = voc_hispanic)

summary(bilingual_hisp_lm)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: binned_binary_stop ~ word_pos + preceding_cat + spanish_bilingual +  
##     site + (1 | speaker.of.DH_data_concat) + (1 | edit_word)
##    Data: voc_hispanic
## 
## REML criterion at convergence: 3291.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1376 -0.5894 -0.2477  0.1547  3.0383 
## 
## Random effects:
##  Groups                    Name        Variance Std.Dev.
##  speaker.of.DH_data_concat (Intercept) 0.013893 0.11787 
##  edit_word                 (Intercept) 0.001193 0.03454 
##  Residual                              0.125345 0.35404 
## Number of obs: 4079, groups:  speaker.of.DH_data_concat, 64; edit_word, 29
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)               0.24657    0.04230  100.38812   5.829 6.78e-08 ***
## word_posmedial           -0.08757    0.02322   65.04878  -3.772 0.000353 ***
## preceding_catobstruent    0.21369    0.02310 1748.21599   9.249  < 2e-16 ***
## preceding_catsonorant    -0.03301    0.02395 1291.88573  -1.378 0.168360    
## preceding_catvowel       -0.05630    0.02584 2037.85154  -2.179 0.029429 *  
## spanish_bilingualyes     -0.04197    0.03716   59.68009  -1.129 0.263236    
## siteBAK                  -0.02645    0.03274   59.55654  -0.808 0.422519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) wrd_ps prcdng_ctb prcdng_cts prcdng_ctv spnsh_
## word_posmdl -0.107                                               
## prcdng_ctbs -0.399 -0.008                                        
## prcdng_ctsn -0.391 -0.038  0.745                                 
## prcdng_ctvw -0.353 -0.422  0.676      0.680                      
## spnsh_blngl -0.710  0.006 -0.017     -0.021     -0.013           
## siteBAK     -0.437  0.004  0.002      0.000     -0.002      0.186

Salinas Only

Things to do next time:

Investigate Gender Build models for the intermediate realizations Gender x Bilingualism ? Model 4 (Salinas’ Version)